Goto

Collaborating Authors

 level 4



Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Yi, Zihao, Jiang, Qingxuan, Ma, Ruotian, Chen, Xingyu, Yang, Qu, Wang, Mengru, Ye, Fanghua, Shen, Ying, Tu, Zhaopeng, Li, Xiaolong, Linus, null

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.


AI Agents for Photonic Integrated Circuit Design Automation

Sharma, Ankita, Fu, YuQi, Ansari, Vahid, Iyer, Rishabh, Kuang, Fiona, Mistry, Kashish, Aishy, Raisa Islam, Ahmad, Sara, Matres, Joaquin, Englund, Dirk R., Poon, Joyce K. S.

arXiv.org Artificial Intelligence

We present Photonics Intelligent Design and Optimization (PhIDO), a multi-agent framework that converts natural-language photonic integrated circuit (PIC) design requests into layout mask files. We compare 7 reasoning large language models for PhIDO using a testbench of 102 design descriptions that ranged from single devices to 112-component PICs. The success rate for single-device designs was up to 91%. For design queries with less than or equal to 15 components, o1, Gemini-2.5-pro, and Claude Opus 4 achieved the highest end-to-end pass@5 success rates of approximately 57%, with Gemini-2.5-pro requiring the fewest output tokens and lowest cost. The next steps toward autonomous PIC development include standardized knowledge representations, expanded datasets, extended verification, and robotic automation.


Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs

Mustafa, Akram, Naseem, Usman, Azghadi, Mostafa Rahimi

arXiv.org Artificial Intelligence

Background: Clinical coding, particularly the classification of hierarchical ICD-10 codes from unstructured discharge summaries, is essential for healthcare operations, but remains a labor-intensive and error-prone task. Automated approaches using Large Language Models (LLMs) offer the potential to augment or replace human coders, yet their reliability and reasoning capabilities, which is needed to ensure accurate, explainable code assignments, are not well understood. Objective: This study aims to benchmark a diverse set of LLMs, both reasoning and non-reasoning models, on their ability to classify hierarchical ICD-10 codes from discharge summaries and evaluate the effect of structured reasoning on model performance. Methods: Using the MIMIC-IV dataset, the study selected 1,500 discharge summaries labeled with the top 10 most frequent ICD-10 codes, balancing dataset size with the high computational and financial cost of using LLMs. We first preprocessed the data to extract clinically relevant tokens before feeding it to the LLMs. Specifically, we used cTAKES, a clinical NLP tool, to identify medical concepts. Each summary was encoded and submitted to 11 LLMs using a standardized, structured prompt simulating a clinical coder. Models were evaluated using the F1 score across three ICD-10 levels for both primary and all diagnoses classification tasks. Reasoning models on average outperformed non-reasoning models. The Gemini 2.5 Pro model demonstrated the highest performance across tasks.


Fine-grained Hierarchical Crop Type Classification from Integrated Hyperspectral EnMAP Data and Multispectral Sentinel-2 Time Series: A Large-scale Dataset and Dual-stream Transformer Method

Li, Wenyuan, Liang, Shunlin, Zhang, Yuxiang, Liu, Liqin, Chen, Keyan, Chen, Yongzhe, Ma, Han, Xu, Jianglei, Ma, Yichuan, Guan, Shikang, Shi, Zhenwei

arXiv.org Artificial Intelligence

Fine-grained crop type classification serves as the fundamental basis for large-scale crop mapping and plays a vital role in ensuring food security. It requires simultaneous capture of both phenological dynamics (obtained from multi-temporal satellite data like Sentinel-2) and subtle spectral variations (demanding nanometer-scale spectral resolution from hyperspectral imagery). Research combining these two modalities remains scarce currently due to challenges in hyperspectral data acquisition and crop types annotation costs. To address these issues, we construct a hierarchical hyperspectral crop dataset (H2Crop) by integrating 30m-resolution EnMAP hyperspectral data with Sentinel-2 time series. With over one million annotated field parcels organized in a four-tier crop taxonomy, H2Crop establishes a vital benchmark for fine-grained agricultural crop classification and hyperspectral image processing. We propose a dual-stream Transformer architecture that synergistically processes these modalities. It coordinates two specialized pathways: a spectral-spatial Transformer extracts fine-grained signatures from hyperspectral EnMAP data, while a temporal Swin Transformer extracts crop growth patterns from Sentinel-2 time series. The designed hierarchical classification head with hierarchical fusion then simultaneously delivers multi-level crop type classification across all taxonomic tiers. Experiments demonstrate that adding hyperspectral EnMAP data to Sentinel-2 time series yields a 4.2% average F1-scores improvement (peaking at 6.3%). Extensive comparisons also confirm our method's higher accuracy over existing deep learning approaches for crop type classification and the consistent benefits of hyperspectral data across varying temporal windows and crop change scenarios. Codes and dataset are available at https://github.com/flyakon/H2Crop.


Analyzing the Impact of AI Tools on Student Study Habits and Academic Performance

Ward, Ben, Bhati, Deepshikha, Neha, Fnu, Guercio, Angela

arXiv.org Artificial Intelligence

This study explores the effectiveness of AI tools in enhancing student learning, specifically in improving study habits, time management, and feedback mechanisms. The research focuses on how AI tools can support personalized learning, adaptive test adjustments, and provide real-time classroom analysis. Student feedback revealed strong support for these features, and the study found a significant reduction in study hours alongside an increase in GPA, suggesting positive academic outcomes. Despite these benefits, challenges such as over-reliance on AI and difficulties in integrating AI with traditional teaching methods were also identified, emphasizing the need for AI tools to complement conventional educational strategies rather than replace them. Data were collected through a survey with a Likert scale and follow-up interviews, providing both quantitative and qualitative insights. The analysis involved descriptive statistics to summarize demographic data, AI usage patterns, and perceived effectiveness, as well as inferential statistics (T-tests, ANOVA) to examine the impact of demographic factors on AI adoption. Regression analysis identified predictors of AI adoption, and qualitative responses were thematically analyzed to understand students' perspectives on the future of AI in education. This mixed-methods approach provided a comprehensive view of AI's role in education and highlighted the importance of privacy, transparency, and continuous refinement of AI features to maximize their educational benefits.


Tokyo airport trials driverless cargo vehicle

The Japan Times

Tokyo's Haneda Airport is trialing a driverless vehicle to tow cargo containers in an attempt to get around labor shortages as the number of tourists flying into Japan soars. The vehicle at one of the world's busiest airports can tow up to 13 metric tons of containers, joint developers All Nippon Airways (ANA) and Toyota Industries said in a statement. It can pull up to six containers at a time, trundling between aircraft and airport buildings over a distance of around 2 kilometers with no driver in the cab. The Level 4 vehicle, meaning that it does not require human interaction in certain settings -- although a human driver can still request control -- has been in operation since July 1. The trial, the first at a Japanese airport, is part of government-backed efforts to innovate the air transport industry, and the companies aim to make the vehicle fully operational by the end of next year, they said.


Self-driving gets off to bumpy start in Fukui town

The Japan Times

Technical and financial problems have been identified in the year since Japan's first transportation service using so-called Level 4 autonomous driving began in the town of Eiheiji, Fukui Prefecture. Amid the country's declining population, Level 4 autonomous driving, or driving that is fully automated under certain conditions, is viewed as a promising means of transport. The service in Eiheiji, however, has shown the hurdles that must be cleared. On May 28, 2023, the service was launched on a 2-kilometer section of a walkway in Eiheiji. It is available only on Saturdays, Sundays and national holidays.


The SAMER Arabic Text Simplification Corpus

Alhafni, Bashar, Hazim, Reem, Liberato, Juan Piñeros, Khalil, Muhamed Al, Habash, Nizar

arXiv.org Artificial Intelligence

We present the SAMER Corpus, the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels. We describe the corpus selection process, and outline the guidelines we followed to create the annotations and ensure their quality. Our corpus is publicly available to support and encourage research on Arabic text simplification, Arabic automatic readability assessment, and the development of Arabic pedagogical language technologies.


A new Taxonomy for Automated Driving: Structuring Applications based on their Operational Design Domain, Level of Automation and Automation Readiness

Betz, Johannes, Lutwitzi, Melina, Peters, Steven

arXiv.org Artificial Intelligence

The aim of this paper is to investigate the relationship between operational design domains (ODD), automated driving SAE Levels, and Technology Readiness Level (TRL). The first highly automated vehicles, like robotaxis, are in commercial use, and the first vehicles with highway pilot systems have been delivered to private customers. It has emerged as a crucial issue that these automated driving systems differ significantly in their ODD and in their technical maturity. Consequently, any approach to compare these systems is difficult and requires a deep dive into defined ODDs, specifications, and technologies used. Therefore, this paper challenges current state-of-the-art taxonomies and develops a new and integrated taxonomy that can structure automated vehicle systems more efficiently. We use the well-known SAE Levels 0-5 as the "level of responsibility", and link and describe the ODD at an intermediate level of abstraction. Finally, a new maturity model is explicitly proposed to improve the comparability of automated vehicles and driving functions. This method is then used to analyze today's existing automated vehicle applications, which are structured into the new taxonomy and rated by the new maturity levels. Our results indicate that this new taxonomy and maturity level model will help to differentiate automated vehicle systems in discussions more clearly and to discover white fields more systematically and upfront, e.g. for research but also for regulatory purposes.